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Abstract 

Online social networks such as Facebook, Twitter and Gowalla allow people to communicate and interact across borders. In 
past years online social networks have become increasingly important for studying the behavior of individuals, group 
formation, and the emergence of online societies. Here we focus on the characterization of the average growth of online 
social networks and try to understand which are possible processes behind seemingly long-range temporal correlated 
collective behavior. In agreement with recent findings, but in contrast to Gibrat's law of proportionate growth, we find 
scaling in the average growth rate and its standard deviation. In contrast, Renren and Twitter deviate, however, in certain 
important aspects significantly from those found in many social and economic systems. Whereas independent methods 
suggest no significance for temporally long-range correlated behavior for Renren and Twitter, a scaling analysis of the 
standard deviation does suggest long-range temporal correlated growth in Gowalla. However, we demonstrate that 
seemingly long-range temporal correlations in the growth of online social networks, such as in Gowalla, can be explained by 
a decomposition into temporally and spatially independent growth processes with a large variety of entry rates. Our analysis 
thus suggests that temporally or spatially correlated behavior does not play a major role in the growth of online social 
networks. 
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Introduction 

Online social networks (OSNs) have become increasingly 
important as they allow us to interact across any geographical 
scale. Communication networks, transport networks and OSNs 
are often interconnected and interdependent. This opens up great 
economic and social opportunities but can also involve consider- 
able risks such as cascading breakdowns [1]. The study of OSNs is 
of importance for understanding the behavior of individuals, 
groups and societies. Hence, particular types of growth in social, 
economic and other networked systems have attracted a lot of 
attraction in the past years [2-8]. 

Gibrat's law states that both the average growth rate and the 
standard deviation of the growth rate of a given observable are 
constant and independent of the specific value of the observable 
[9] . However, this empirical law, originally observed in economic 
systems, has been challenged by many socio-economic studies 
[10,1 1], notably very recently [4,5,8]. 

Any social growth dynamics is expected to depend on social 
factors such as gender, age, social status and so forth. Unfortu- 
nately, available datasets that comprise such information are 
typically too small to investigate emergent scaling or large-scale 
collective behavior. In this paper, we focus on the population 
growth dynamics of three large OSNs. Our datasets do not resolve 
individual social factors but their size allows for studying scaling 
and long-range correlations, both temporally and spatially. 

We find evidence for certain scaling laws in the growth rate and 
the variance, although for Renren and Twitter the exponents 
characterizing fluctuations are found to deviate from those that 
have been reported previously for social and economic systems. 



These deviations carry important information about the growth of 
online social systems. In particular, we find that the relative 
number of registered users increases almost temporally and 
spatially independently of each other. This contrasts the behavior 
of offline growth in many social and economic systems where 
growth is a long-range correlated process and thus a collective 
phenomenon. Even for Gowalla where scaling indicates long- 
range correlated growth a decomposition into independent growth 
processes unravels the seemingly long-range collective behavior to 
be a mere artifact of the large variability of entry rates [12]. 

Data 

We analyze three OSN datasets. The first OSN Renren (rr), 
often referred to as the "Chinese Facebook", is one of the largest 
online social networks in China. The dataset covers about 
1,000,000 users in the time period of January 2006 to December 
2010 (60 months) with online interactions from over 10,000 
registered locations. 

The second OSN data set, comes from a subset of Twitter (tw), 
a microblogging online social service sited in the United States. It 
covers more than 250,000 members between August 2006 and 
September 2010 (50 months) from about 9,000 locations. 

The third OSN, Gowalla (gw), was an online check in social 
service launched in 2007 and closed in 2012 in the United States. 
Users were able to check in at certain locations, referred to as Spots, 
either through a mobile phone application or Gowalla' s mobile 
website. Among other things, checking-in allowed for the dropping 
or swapping of virtual items. The dataset covers 2 1 months (from 
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Figure 1. Scaling in average growth rate and standard deviation. Both <r(S 0 )> and a(r\So) as a function of the initial population size So 

exhibit a power law, <r(S 0 )>~ S^ a , a(r\S 0 )~ S~ p . Renren: (r(S 0 )} ~Sq 0 - 07 (R 2 =0.326), ^rlSo)-^ 0 - 45 (R 2 = 0.526), that is, a rr = 0.07, ft, = 0.45, 
Twitter: a tw = 0.04 (i? 2 = 0.142), ft w = 0.37 (i? 2 = 0.531), Gowalla: a gw = 0.09 (i? 2 = 0.343), £ gw = 0.35 CR 2 =0.625). All values are obtained from MLE. 
Bootstrapping suggests 95% confidence for a>0 (violation of Gibrat's law), and for /? gw <0.5 (suggesting long-rang correlations). No statistical 
significance is found for /?<0.5 for Renren and Twitter. Vertical lines indicate 5% marks (insets). 
doi:1 0.1 371 /journal.pone.01 00023.g001 



February 2009 to October 2010) with around 200,000 members 
from about 5,000 locations. 

We acquired the first two datasets by crawling user profiles in 
the web sites from Renren.com [13] and Twitter.com [14] 
through their APIs. We only crawled the user profiles which are 
publicly available. The Gowalla dataset is obtained from a shared 
data source [15] by other researchers. Due to the tremendous size 
of OSNs, we only acquired a sampled subset of each OSN. To 



eliminate sample bias we deployed the Breadth First Search (BFS) 
bias correction procedure by Kurant et al. [16]. 

For these three datasets we define a population at a location / (an 
integer number ID) at time t as the set of all users with home 
location /. The spatial resolution of the location refers to as a city 
code, associated with the administrative area (i.e. the city name) of 
the user's home location. For Renren and Twitter we assume the 
registered location of the user as the user's home location. For 
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Figure 2. Complementary cumulative relative fluctuation function. The ccrff, equation (5), as a function of the initial population size So. For 
Renren and Twitter the ccrff is well fitted by shifted power law ~(S 0 -C) _V , with C a constant: (a) Result for Renren: v rr = 1.828 (i? 2 = 0.997). (b) 
Result for Twitter: v tw = 1.324 {R 2 = 0.995). (c) For Gowalla the ccrff is bimodal with a cutoff point at S*&93 (obtained from MLE, see methods): the 
left part is well fitted by an exponential and the right part is in good agreement with a power law decay, ccrff(So) ~e~ fJ,s ° for So<S*, and ccrff(So) 
~S 0 ~ V for S 0 >S*. Fit exponents /^ gw = 0.009 (R 2 = 0.991), and v gw = 0.13 CR 2 =0.966). 
doi:1 0.1 371 /journal.pone.01 00023.g002 



Gowalla we assume the most visited location as the home location. 
For the spatial analysis we used an assignment of GPS coordinates 
(via Google Maps Application Programming Interfaces (API)) to 
the location /, and calculated the distance between two locations 
via their GPS coordinates. The estimated GPS coordinates of a 
user's home location may thus be incorrect for a certain fraction of 
a given population. This, however, may not alter any of the 
conclusions made in this article. 

Results 

Here, we investigate the mean growth rate and its fluctuation in 
OSN populations and ask the question how these observables 
depend on the initial population size. 

We denote the population size, i.e. the number of users with 
home location index 1 < / < / max at time 0<t<T, by S l (t). 
Following Refs. [17-19] we define the logarithmic growth rate r 
between time to an d t\ (to <t\<T) as 



r(S 0 )=\n 



Si 



(1) 



where So = S l (to) and S\=S l (t\) are the population size at a 
location / but at different time points to and t\ [5]. 

To characterize fluctuations, we study the average growth rate 
<r(*So)> and the standard deviation 



<y{r\s 0 y- 



-\lmSoY 



<r(So)» 2 > 



(2) 



as a function of the initial population size *So, see Figure 1. In other 
words, the average growth rate <r(*So)) corresponds to only those 
online populations with size at least *So until time to. The 
conditional standard deviation of the growth rate a(r\So) for those 
populations expresses the statistical spread or fluctuation of growth 
among populations with *So- Both quantities show a power law 
dependence on the initial population size, 
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Figure 3. Gowalla: Temporal short- and long-term correlations in the population growth rate. Short-term correlations for S 0 <S* (log-lin 
plots), C{T)~e- yx , and long-term correlations for S 0 >S* (log-log plots), C(t)~t" 5 . Fits using MLE suggest y = 0.13 (i? 2 = 0.992), (5 = 0.73 
{R 2 = 0.955); log-log-scaling for determining the coefficient of determination for the power law, and log-linear-scaling for the exponential. 
doi:1 0.1 371 /journal. pone.01 00023.g003 



<r(S 0 )>~S 0 - a and a(r|S 0 )~ V ( 3 ) 

with positive exponents a, /?>0 (P<0.05), which suggests a 
deviation from the independence of Gibrat's law that would imply 
^Gibrat = fcibrat = 0- Scale-invariant growth instead of Gibrat's 
proportionate growth has been reported for economic systems 
such as firms [10] and countries (/? = 0.15 — 0.18) [20], research 
and development expenditures at universities ^ 0.25) [21], 
scientific output (/? = 0.28 — 0.40) [22], and more recently for city 
population growth (/? = 0.19 — 0.27) [4] and online communities 
(£ = 0.15-0.22) [5]. 

The range of /? for Renren and Twitter is in agreement with 
those previously reported exponents for However, in contrast to 
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Figure 4. Gowalla: Decomposition of the growth into indepen- 
dent short-term correlated population growth processes. ACF 

for the three data sets according to the superposition scheme explained 
in the text. The power law exponents from fitting C(z)~x~ ssm obtained 
from the decomposition via surrogate data (sur). Best fits from MLE: 
<5 sur = 0.59 (R 2 = 0.986). 
doi:1 0.1 371 /journal. pone.01 00023.g004 



the previous work mentioned above our analysis (employing 
maximal likelihood estimation (MLE) and bootstrapping) do not 
suggest significant deviations from /? = 0.5 which would indicate 
uncorr elated growth. In contrast, for Gowalla we find 
significantly smaller than /? = 0.5 (P<0.05). 

Second, we find the range of exponents for the average growth 
rate for all studied online social networks significantly above a = 0 
(P<0.05) indicating a violation of Gibrat's proportionate growth, 
which is in agreement with social and economic systems 
[4,5,10,20-22]. 

The average growth rate <r(*So)> and the conditional standard 
deviation of the growth rate (j(r\So) allow for direct comparison 
with the literature for other social systems and Gibrat's law but are 
only averages. As suggested by studies of certain assets in 
economical systems the distribution of the variance can often 
exposes important information that cannot be seen in averages 
[23]. 

Since for a given Sq there is only a single value of a(r\So) (see 
Fig. 1), we ask what is the relative variation of a(r\So) across all 
values of a(r|*S 0 ) that occur in a given dataset. We thus focus on 
the relative fluctuation function (rff), 

rff(S 0 ) = (7(r|5o)y/^(7(r|^) (4) 

as a function of Sq. Specifically, we study the complementary 
cumulative relative fluctuation function (ccrff), which is given by 
the complement of the integrated rff, 

ccrff(5 0 )=l- rff ^o)- (5) 

s' Q <s 0 

We chose the ccrff representation because it shows (if exists) a 
clearer scaling than the rff and thus better exposes different 
(scaling) regimes. The ccrff is obtained by collecting all locations 
with a given value of So using exponential binning (see Fig. 2 and 
Methods). 

In contrast to Renren and Twitter where we find no significant 
bimodality, for Gowalla the ccrff as a function of So exhibits a 
remarkable bimodal behavior. 
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Figure 5. Spatial independence of the population growth rates. The mean correlation coefficient <c> of the population growth rate as a 
function of geographical distance (log-log plot), (a) for Renren: plateau at about <c rr >^0.80, (b) for Twitter: <c tw >~0.73, (c) for Gowalla: <c gw >^0.76. 
doi:1 0.1 371 /journal.pone.01 00023.g005 



For Gowalla, Figure 2C suggests a bimodal distribution of 
standard deviations, characterized by an exponential decay that is 
followed by a power law 




e-^o (S 0 <S*) 
S 0 ~ v (S 0 >S*) ' 



(6) 



S* = 93 for Gowalla marks the crossover point (determined from 
MLE, see Methods). MLE suggests that the power law decay is 
characterized by the exponent v gw = 0.13 (R 2 = 0.966). 

Gowalla: Correlations in the growth rate 

The above findings suggest to consider two groups of locations: 
one group with initial population size Sq<S*, and the other one 
with initial population size Sq>S*. We study the monthly 
population growth rates for each location and calculate their 
autocorrelation function (ACF) [24-26] . For So < S* the ensemble 
averaged ACF exhibits an exponential decay, C(x)~ e~ yT , 
indicating that the population growth is short-term correlated, 
see Figure 3A. We obtain the exponent y = 0.13 (R 2 = 0.992 from 
MLE), which is equivalent to a correlation time constant of about 
two weeks. 



In contrast, for So > S* the ACF is well described by a power 
law, C(t)~t~ s with power law exponent 3 = 0.13 (R 2 = 0.955 
from MLE), see Figure 3B. 

This is consistent with long-term correlations characterized by 
X^-oo C(t)=co, see [27] and references therein. 

Superposition model 

Seemingly long-range correlation can often be explained by a 
finite set of independent processes whose superposition accounts 
for the algebraic decay in the ACF, and the divergence of its 
infinite sum. In 1979 van der Ziel established that any ensemble of 
uncoupled short-range correlated stochastic oscillators is sufficient 
for explaining long-range correlations in their superposition, if and 
only if the time constants of the mixed processes are sufficiently 
broadly distributed [12]. More recently, it has been shown that a 
superposition of Poisson processes, together with circadian activity, 
very likely account for many scaling laws of human activity 
patterns [28] . Here, as the growth rates are broadly distributed, we 
follow this spirit by considering a superposition of populations and 
surrogate time series from these, see Methods. 

Gowalla' s population growth of the superposition ensemble 
obtained from a random selection of population with *So < S* 
results in the occurrence of seemingly long-term correlations for 
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Figure 6. Selection method for/ 0 . The number of locations with 
growing populations {S(t\)> S(t 0 )) as a function of time. We account 
for the tradeoff between a large number of populated locations and a 
large time series length t\ — to by choosing to close to the maximum, cf. 
Methods. This results (a) for Renren to to : =35, (b) for Twitter to 
to : =36, and (c) for Gowalla to to : = 14. 
doi:1 0.1 371 /journal, pone.0100023.g006 

locations with So>S*. The exponents 3 sm for the surrogate 
superpositions (sur) are obtained from fitting the superposition 
ensemble averasf ed ACF by MLE (see Methods) with R 2 = 0.986, 
(f ur = 0.59 {R 2 =0.986, 3 = 0.13), see Figure 4. 

This suggests that the seemingly long-term correlated popula- 
tion growth found for locations with So > S* results from 
superpositions of short-term correlated growing populations. 

Spatial dependence 

To study geographical factors we investigate correlations of the 
populations growth rates r z - and rj between different places [29,30]. 
We therefore study the Pearson's correlation coefficient 



where cr^- = y (Sjfij ~ (fij )) is me standard deviation off/ and rj, 
respectively. 

We investigate the monthly population growth rates and 
Pearson's correlation coefficient between a pair of locations as a 
function of the geographic distance of the users. Figure 5 shows the 
Pearson's correlation coefficients for the three data sets. The 
average correlation <c> is found at a level of about 0.7 — 0.8, 
effectively independent of the geographic distance. The high value 
of the cross correlation agrees well with the plausible assumption 
that individuals join online social networks collectively but 
independently of the geographic distance to each other. 

Discussion 

We find scaling in the population growth rate and variance in 
online social networks. Our results suggest that the population 
growth in online social networks is neither significantly determined 
by population size [31] nor by spatial factors. The results deviate 
from Gibrat's law as previously found in many social and 
economic systems. The seemingly long-term correlated growth 
behavior for Gowalla suggested by scaling in the standard 
deviation is explained by a simple decomposition into short-term 
correlated population growth with broadly distributed growth 
rates. Our method may help interpreting (seemingly) long-range 
correlations in the growth of large heterogenous (online) social and 
economic systems. Seemingly collective behavior in online social 
systems may result from the high variability of loners' actions and 
not from correlated collective behavior. 

Methods 

Ethics statement 

We use the APIs that provided by Renren.com and Twitter.com 
for data collection from these two websites. The acquirement of 
Renren and Twitter datasets is in accordance with the websites' 
terms of service. 

Data availability statement 

We use three datasets in this article. The Renren and Twitter 
datasets can be obtained upon the request, which is "data 
available on request". The request can be send to the Computer 
Networks Group at University of Gottingen via email (net@cs.uni- 
goettingen.de). The Gowalla dataset is obtained from a shared 
data source [15] by other researchers. The requester can 
download it from snap.stanford.edu, which is "data available 
from online". 

Exponential binning 

Fitting of average, standard deviation and the ccrff is performed 
by exponential binning, by which the bins are evenly distributed 
on a logarithmic scale. Specifically, the beginning of each bin is 
bj =[_cR ) 'j, exponentially increasing in j, with constants c and 

b j+ i-bj&bj(R-\). We use 
(with c=l, R = 2) and 



R>\, so that bins have size uj + \ 
exponential binning for both Figure 1 
Figure 2 (with c = 1 , R = 1 .2). 



<(r/-<r f >)(r y -<!>•>)> 



(7) 



Choice of to and t\ 

The datasets are analyzed within a time window given by the 
time points to and t\ . t\ is chosen as the end point of the data set, 
t\ : =T. For the choice of to we consider two factors: the number 
of populated locations in the time window and the size of the time 
window. A too small to would lead to only a few populated 
locations whereas any large to would reduce the width of the 
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Figure 7. The selection of S* for Gowalla. The fitting quality as 
function of Sq. S* is defined as the position (argmax) of the maximum, 
which is S* =93 for Gowalla. 
doi:1 0.1 371 /journal. pone.01 00023.g007 

window. Following the methodology of studies in human 
population growth in the real world [4] and the human interaction 
activities in OSNs [5], we determined to as a result from time 
when the number of locations with growing populations reaches 
the peak. That is, to : =35 for Renren, to : =36 for Twitter and 
to ' =14 for Gowalla, respectively, see Figure 6. 

Determination of S* 

To determine the best value for S*, we fit the distribution of 
standard deviation with respect to So ranging from 0 to 300 by 
using MLE. For each So, we calculate R 2 for exponential and 
power-law fitting, denoted as i?g xp and R^ ow , respectively. To 
characterize the overall fitting quality (FQ) we use 

F Q = R lxp R low ( 8 ) 

where we use log-log-scaling for determining the coefficient of 



determination for the power law, and log-linear-scaling for the 
exponential. 

We choose S* = argmax(Fg(»So)) where FQ takes its maximum 
at the value of S* =93, as shown in Figure 7. 

Spatially resolved monthly growth rates 

For each location with integer ID /, we extract a time series 
from to to t\ of the monthly population growth rate according to 
equation (1) as r t = ln^ 1 , to<t<ti being the tth month. 

We calculate the autocorrelation function (ACF) from the time 
series r t as 

C(T)= <fr-<*>X' r -<*»> (9) 

where t is the time lag and o r is the standard deviation of r t . 

Superposition construction 

To study superpositions we select all populations at locations 
with So<S*. The randomized surrogate data set is created by 
shuffling these entries, and creating a time series from these 
shuffled entries as follows. 

(1) From the set of populations with So <S* we select randomly 
a population and add up its initial population size So, irrespective 
of its location. (2) We repeat (1) until the sum exceeds S*. This 
results in a set of locations whose total initial population size equals 
or slightly exceeds S*. We call this set of locations one realization 
of a superposition. (3) For each realization we study the temporal 
development with respect to total populations size of the thereafter 
fixed selected locations. For each superposition we construct a time 
series, that is, the population growth rates in monthly resolution, 
from to to t\ . For this set of time series we obtain the ensemble 
averaged ACF. 
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